{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Fitting model on imbalanced datasets and how to fight bias\n\nThis example illustrates the problem induced by learning on datasets having\nimbalanced classes. Subsequently, we compare different approaches alleviating\nthese negative effects.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Problem definition\n\nWe are dropping the following features:\n\n- \"fnlwgt\": this feature was created while studying the \"adult\" dataset.\n  Thus, we will not use this feature which is not acquired during the survey.\n- \"education-num\": it is encoding the same information than \"education\".\n  Thus, we are removing one of these 2 features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_openml\n\ndf, y = fetch_openml(\"adult\", version=2, as_frame=True, return_X_y=True)\ndf = df.drop(columns=[\"fnlwgt\", \"education-num\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The \"adult\" dataset as a class ratio of about 3:1\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "classes_count = y.value_counts()\nclasses_count"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This dataset is only slightly imbalanced. To better highlight the effect of\nlearning from an imbalanced dataset, we will increase its ratio to 30:1\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.datasets import make_imbalance\n\nratio = 30\ndf_res, y_res = make_imbalance(\n    df,\n    y,\n    sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},\n)\ny_res.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will perform a cross-validation evaluation to get an estimate of the test\nscore.\n\nAs a baseline, we could use a classifier which will always predict the\nmajority class independently of the features provided.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import cross_validate\nfrom sklearn.dummy import DummyClassifier\n\ndummy_clf = DummyClassifier(strategy=\"most_frequent\")\nscoring = [\"accuracy\", \"balanced_accuracy\"]\ncv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)\nprint(f\"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Instead of using the accuracy, we can use the balanced accuracy which will\ntake into account the balancing issue.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\n    f\"Balanced accuracy score of a dummy classifier: \"\n    f\"{cv_result['test_balanced_accuracy'].mean():.3f}\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Strategies to learn from an imbalanced dataset\nWe will use a dictionary and a list to continuously store the results of\nour experiments and show them as a pandas dataframe.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index = []\nscores = {\"Accuracy\": [], \"Balanced accuracy\": []}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Dummy baseline\n\nBefore to train a real machine learning model, we can store the results\nobtained with our :class:`~sklearn.dummy.DummyClassifier`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n\nindex += [\"Dummy classifier\"]\ncv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Linear classifier baseline\n\nWe will create a machine learning pipeline using a\n:class:`~sklearn.linear_model.LogisticRegression` classifier. In this regard,\nwe will need to one-hot encode the categorical columns and standardized the\nnumerical columns before to inject the data into the\n:class:`~sklearn.linear_model.LogisticRegression` classifier.\n\nFirst, we define our numerical and categorical pipelines.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.pipeline import make_pipeline\n\nnum_pipe = make_pipeline(\n    StandardScaler(), SimpleImputer(strategy=\"mean\", add_indicator=True)\n)\ncat_pipe = make_pipeline(\n    SimpleImputer(strategy=\"constant\", fill_value=\"missing\"),\n    OneHotEncoder(handle_unknown=\"ignore\"),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then, we can create a preprocessor which will dispatch the categorical\ncolumns to the categorical pipeline and the numerical columns to the\nnumerical pipeline\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.compose import make_column_transformer\nfrom sklearn.compose import make_column_selector as selector\n\npreprocessor_linear = make_column_transformer(\n    (num_pipe, selector(dtype_include=\"number\")),\n    (cat_pipe, selector(dtype_include=\"category\")),\n    n_jobs=2,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we connect our preprocessor with our\n:class:`~sklearn.linear_model.LogisticRegression`. We can then evaluate our\nmodel.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import LogisticRegression\n\nlr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index += [\"Logistic regression\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that our linear model is learning slightly better than our dummy\nbaseline. However, it is impacted by the class imbalance.\n\nWe can verify that something similar is happening with a tree-based model\nsuch as :class:`~sklearn.ensemble.RandomForestClassifier`. With this type of\nclassifier, we will not need to scale the numerical data, and we will only\nneed to ordinal encode the categorical data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import OrdinalEncoder\nfrom sklearn.ensemble import RandomForestClassifier\n\nnum_pipe = SimpleImputer(strategy=\"mean\", add_indicator=True)\ncat_pipe = make_pipeline(\n    SimpleImputer(strategy=\"constant\", fill_value=\"missing\"),\n    OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=-1),\n)\n\npreprocessor_tree = make_column_transformer(\n    (num_pipe, selector(dtype_include=\"number\")),\n    (cat_pipe, selector(dtype_include=\"category\")),\n    n_jobs=2,\n)\n\nrf_clf = make_pipeline(\n    preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index += [\"Random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The :class:`~sklearn.ensemble.RandomForestClassifier` is as well affected by\nthe class imbalanced, slightly less than the linear model. Now, we will\npresent different approach to improve the performance of these 2 models.\n\n### Use `class_weight`\n\nMost of the models in `scikit-learn` have a parameter `class_weight`. This\nparameter will affect the computation of the loss in linear model or the\ncriterion in the tree-based model to penalize differently a false\nclassification from the minority and majority class. We can set\n`class_weight=\"balanced\"` such that the weight applied is inversely\nproportional to the class frequency. We test this parametrization in both\nlinear model and tree-based model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "lr_clf.set_params(logisticregression__class_weight=\"balanced\")\n\nindex += [\"Logistic regression with balanced class weights\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rf_clf.set_params(randomforestclassifier__class_weight=\"balanced\")\n\nindex += [\"Random forest with balanced class weights\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that using `class_weight` was really effective for the linear\nmodel, alleviating the issue of learning from imbalanced classes. However,\nthe :class:`~sklearn.ensemble.RandomForestClassifier` is still biased toward\nthe majority class, mainly due to the criterion which is not suited enough to\nfight the class imbalance.\n\n### Resample the training set during learning\n\nAnother way is to resample the training set by under-sampling or\nover-sampling some of the samples. `imbalanced-learn` provides some samplers\nto do such processing.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler\nfrom imblearn.under_sampling import RandomUnderSampler\n\nlr_clf = make_pipeline_with_sampler(\n    preprocessor_linear,\n    RandomUnderSampler(random_state=42),\n    LogisticRegression(max_iter=1000),\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index += [\"Under-sampling + Logistic regression\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rf_clf = make_pipeline_with_sampler(\n    preprocessor_tree,\n    RandomUnderSampler(random_state=42),\n    RandomForestClassifier(random_state=42, n_jobs=2),\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index += [\"Under-sampling + Random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Applying a random under-sampler before the training of the linear model or\nrandom forest, allows to not focus on the majority class at the cost of\nmaking more mistake for samples in the majority class (i.e. decreased\naccuracy).\n\nWe could apply any type of samplers and find which sampler is working best\non the current dataset.\n\nInstead, we will present another way by using classifiers which will apply\nsampling internally.\n\n### Use of specific balanced algorithms from imbalanced-learn\n\nWe already showed that random under-sampling can be effective on decision\ntree. However, instead of under-sampling once the dataset, one could\nunder-sample the original dataset before to take a bootstrap sample. This is\nthe base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and\n:class:`~imblearn.ensemble.BalancedBaggingClassifier`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.ensemble import BalancedRandomForestClassifier\n\nrf_clf = make_pipeline(\n    preprocessor_tree,\n    BalancedRandomForestClassifier(random_state=42, n_jobs=2),\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "index += [\"Balanced random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The performance with the\n:class:`~imblearn.ensemble.BalancedRandomForestClassifier` is better than\napplying a single random under-sampling. We will use a gradient-boosting\nclassifier within a :class:`~imblearn.ensemble.BalancedBaggingClassifier`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.experimental import enable_hist_gradient_boosting  # noqa\nfrom sklearn.ensemble import HistGradientBoostingClassifier\nfrom imblearn.ensemble import BalancedBaggingClassifier\n\nbag_clf = make_pipeline(\n    preprocessor_tree,\n    BalancedBaggingClassifier(\n        base_estimator=HistGradientBoostingClassifier(random_state=42),\n        n_estimators=10,\n        random_state=42,\n        n_jobs=2,\n    ),\n)\n\nindex += [\"Balanced bag of histogram gradient boosting\"]\ncv_result = cross_validate(bag_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This last approach is the most effective. The different under-sampling allows\nto bring some diversity for the different GBDT to learn and not focus on a\nportion of the majority class.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}